Towards End-to-End Synthetic Speech Detection

نویسندگان

چکیده

The constant Q transform (CQT) has been shown to be one of the most effective speech signal pre-transforms facilitate synthetic detection, followed by either hand-crafted (subband) cepstral coefficient (CQCC) feature extraction and a back-end binary classifier, or deep neural network (DNN) directly for further classification. Despite rich literature on such pipeline, we show in this paper that pre-transform features could simply replaced end-to-end DNNs. Specifically, experimentally verify only using standard components, light-weight outperform state-of-the-art methods ASVspoof2019 challenge. proposed model is termed Time-domain Synthetic Speech Detection Net (TSSDNet), having ResNet- Inception-style structures. We demonstrate models also have attractive generalization capability. Trained ASVspoof2019, they achieve promising detection performance when tested disjoint ASVspoof2015, significantly better than existing cross-dataset results. This reveals great potential DNNs without features.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards End-to-End Speech Recognition

Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which al...

متن کامل

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. G...

متن کامل

Towards End-To-End Speech Recognition with Recurrent Neural Networks

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the...

متن کامل

Towards Language-Universal End-to-End Speech Recognition

Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters. In this work, we exploit recent progress in end-to-end speech recognition to create a single multilingual speech recogniti...

متن کامل

Towards End-to-End Lane Detection: an Instance Segmentation Approach

Modern cars are incorporating an increasing number of driver assist features, among which automatic lane keeping. The latter allows the car to properly position itself within the road lanes, which is also crucial for any subsequent lane departure or trajectory planning decision in fully autonomous cars. Traditional lane detection methods rely on a combination of highly-specialized, hand-crafted...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Signal Processing Letters

سال: 2021

ISSN: ['1558-2361', '1070-9908']

DOI: https://doi.org/10.1109/lsp.2021.3089437